This project aims to cluster restaurants in 10 metropolitan cities of North America in contiguous groups of geo-spatial locations. Then the insight into the interest of customers that review restaurants of a particular cluster is used to indicate supply and demand porportions of various categories of restaunts.
Index¶
Please click links to jump to the specific area:
Import Libraries
Attribute Analysis
Attribute Selection
Data Clustering
User Data Preparation
Mapping Data Preparation
Data VisualizationPlease click links below to access interactive diagrams and maps:
DBSCAN Min. Neighbors & Distance vs Coverage
DBSCAN Min. Neighbors & Distance vs Cluster Count
DBSCAN Min. Neighbors & Distance vs Largest Cluster Size
DBSCAN Label Count Histogram for Min. Neighbors & Distance
DBSCAN Min. Neighbors & Distance vs CoverageNorth America Clustered Restaurants by Location (All Categories)
All Clustered Restaurants on Sketch (Toronto)
All Clustered Restaurants on Map (Toronto)
Slider Controlled Categories Displaying Demand (Toronto)
Import required libraries
For some of the thirs party libraries, you may have to run 'pip install' commands e.g.
- pip install geopy
- pip install shaply
- pip install matplotlib
- pip install plotly
- pip install cufflinks
#pk.eyJ1IjoiZjhheml6IiwiYSI6ImNqb3plOWp6MjA0bXIzcnFxczZ1bjdrbmwifQ.5qd5W4B06UUZc20Jax12OA
import pandas as pd, numpy as np, matplotlib.pyplot as plt, time, plotly.plotly as py, plotly.graph_objs as go
import multiprocessing as mp
from IPython.core.display import display, HTML
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly.graph_objs import *
from plotly import tools
from collections import Counter
from geopy.distance import great_circle
from shapely.geometry import MultiPoint
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
%matplotlib inline
init_notebook_mode(connected=True)
Read data from the file
The following files from Yelp Dataset will be used:
- yelp_academic_dataset_business.json : Business location coordinates, locations and categories etc.
- yelp_academic_dataset_review.json : User reviews, related business ids etc.
Rest of the data from the dataset is not useful for the purposes of this project.
business_file = "yelp_academic_dataset_business.json"
review_file = "yelp_academic_dataset_review.json"
start_time = time.time()
df_business_data_full = pd.read_json(business_file, lines=True)
df_review_data_full = pd.read_json(review_file, lines=True)
print('Time taken: {:,.2f} seconds'.format(time.time()-start_time))
Reveal datatypes and sample data rows from the DataSet
1. Business data datatypes and attributes
df_business_data_full.info()
df_business_data_full.head(3)
2. Review data types and attributes
df_review_data_full.info()
df_review_data_full.head(3)
Reduce the attribute list to only the useful information for this project.
- From business data, [ business_id, latitude, longitude, city, state, postal_code and categories ]
- From review data, [ business_id, review_id, user_id ]
start_time = time.time()
business_cols = ['business_id', 'latitude', 'longitude', 'city', 'neighborhood', \
'state', 'postal_code', 'stars', 'categories']
review_cols = ['business_id', 'review_id', 'user_id']
df_business_data = df_business_data_full.filter(business_cols , axis=1)
df_review_data = df_review_data_full.filter(review_cols , axis=1)
df_review_data.to_pickle('df_review_data.pkl')
print('Time taken: {:,.2f} seconds'.format(time.time()-start_time))
df_business_data.head()
For the scope of this project, filter down to the businesses located within US/Canada and the ones that have been categorized. This will eliminate noise and small number of restaurants that have not been categorized. Limiting the data to US/Canada will help fit it within North American Map coordinates while retaining majority of the data.
Note: Business categories could be inferred based on user reviews, however, that is outside the scope of this project
start_time = time.time()
print(df_business_data.shape)
north_american_state_provinces = ['AK', 'AL', 'AR', 'AS', 'AZ', 'CA', 'CO', 'CT', 'DC', \
'DE', 'FL', 'GA', 'GU', 'HI', 'IA', 'ID', 'IL', 'IN', \
'KS', 'KY', 'LA', 'MA', 'MD', 'ME', 'MI', 'MN', 'MO', \
'MP', 'MS', 'MT', 'NA', 'NC', 'ND', 'NE', 'NH', 'NJ', \
'NM', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'PR', 'RI', \
'SC', 'SD', 'TN', 'TX', 'UT', 'VA', 'VI', 'VT', 'WA', \
'WI', 'WV', 'WY','AB', 'BC', 'MB', 'NB', 'NL', 'NT', \
'NS', 'NU', 'ON', 'PE', 'QC', 'SK', 'YT']
for idx, row in df_business_data.iterrows():
if row['state'] not in north_american_state_provinces:
df_business_data.drop(idx, inplace=True)
df_business_data = df_business_data[df_business_data['categories'].notnull()]
df_business_data.to_pickle('df_business_data.pkl')
print(df_business_data.shape)
print('Time taken: {:,.2f} seconds'.format(time.time()-start_time))
df_business_data.head()
Filter down to the list of businesses that are categorized as Restaurants
- Cleanup and build list of all categories
- Filter down to rows containing Restaurants as category
- Display number of rows before and after
start_time = time.time()
# create copy so that original business_data is intact
df_business_categorized_data = df_business_data.copy()
df_business_categorized_data['categories'] = df_business_data['categories'] \
.map(lambda x : (list(map(str.strip, x.split(',')))))
print('Total data rows and columns:{}'.format(df_business_categorized_data.shape))
df_restaurants = df_business_categorized_data[df_business_categorized_data['categories'] \
.map(lambda x : 'Restaurants' in x)]
print('Restaurant data rows and columns:{}'.format(df_restaurants.shape))
print('Time taken: {:,.2f} seconds'.format(time.time()-start_time))
Since we have dropped all rows that don't have Restaurant as category, dataframe must be re-indexed to fill the gaps.
df_restaurants = df_restaurants.reset_index(drop=True)
df_restaurants.head()
Display list of unique states. We will use one of these states to cluster at state level to make sense of the clustered data
# list of states included in the dataset
df_restaurants.state.unique()
Top 20 Categories: Combine categories into a list and count top 50. Top 20 categories that represent actual food categories will be used for analysis.
start_time = time.time()
all_categories = df_restaurants['categories'].sum()
ct = Counter(all_categories)
top_50_categories = [x[0] for x in list(ct.most_common(50))]
print('Time taken: {:,.2f} seconds'.format(time.time()-start_time))
print(top_50_categories)
Since we are calculating demand for specific categories of food. To limit the scope of this project, we choose top 20 specific categories of foods that the businesses belong to:
- Sandwiches
- American (Traditional)
- Pizza
- Burgers
- Italian
- Mexican
- Chinese
- American (New)
- Japanese
- Chicken Wings
- Seafood
- Sushi Bars
- Canadian (New)
- Asian Fusion
- Mediterranean
- Steakhouses
- Indian
- Thai
- Vietnamese
- Middle Eastern
top_20_specific_categories = ['Sandwiches', 'American (Traditional)', 'Pizza', \
'Burgers', 'Italian', 'Mexican', 'Chinese', \
'American (New)', 'Japanese', 'Chicken Wings', \
'Seafood', 'Sushi Bars', 'Canadian (New)', \
'Asian Fusion', 'Mediterranean', 'Steakhouses', \
'Indian', 'Thai', 'Vietnamese', 'Middle Eastern']
len(top_20_specific_categories)
Category Reduction: Reduce categories of each business to only include top 20 categories. All categories other than the top 20 selected above are removed for optimization since they are not useful for the purpose of this analysis.
for idx, row in df_restaurants.iterrows():
categories = row['categories']
new_categories = list(set(categories) & set(top_20_specific_categories))
df_restaurants.at[idx, 'categories'] = new_categories
df_restaurants.head()
Getting Dummies Create one column per category within the dataframe with value 1 if that category applies to the business, 0 otherwise. It uses similar approach to Get Dummies which is often used in pandas for optmization.
df_category_flags = pd.DataFrame(0, index=np.arange(len(df_restaurants)), \
columns=top_20_specific_categories)
for index, row in df_restaurants.iterrows():
for category in row['categories']:
df_category_flags.at[index, category] = 1
pd.DataFrame(df_category_flags.sum(), columns=['Count'])
Replace category list for each restaurant with binary flag for each category within restaurants data
df_restaurants_flagged = df_restaurants.join(df_category_flags)
print(len(df_restaurants_flagged))
df_restaurants_flagged.head()
Save flagged restaurants to easily load for analysis later
df_restaurants_flagged.to_pickle('df_restaurants_flagged.pkl')
df_supply_indicator_by_category = df_restaurants_flagged.filter(top_20_specific_categories).sum()
df_supply_indicator_by_category.to_frame('Supply (Restaurant Count)')
Define parameters for DB SCAN clustering algorithm
- epsilon: [ 100 meters ] We are setting 100 meters as the distance limit for a neghboring business to be included within a particular cluster. It means that, as long as, there are businesses within 100 meters of each other, they will keep getting included within the same cluster.
- min_neighbors: [ 4 ] Least number of businesses within 100 meters of one another to declare them a cluster. We will eliminate clusters with less number of businesses than min_neighbors threshold to reduce noise.
Define parameters for DB SCAN clustering algorithm
- epsilon: [ 100 meters ] We are setting 100 meters as the distance limit for a neghboring business to be included within a particular cluster. It means that, as long as, there are businesses within 100 meters of each other, they will keep getting included within the same cluster.
- min_neighbors: [ 4 ] Least number of businesses within 100 meters of one another to declare them a cluster. We will eliminate clusters with less number of businesses than min_neighbors threshold to reduce noise.
kms_per_radian = 6371.0088
epsilon = 0.5 / kms_per_radian
min_neighbors = 4
kms_per_radian = 6371.0088
df_restaurants_flagged = pd.read_pickle('df_restaurants_flagged.pkl')
df_population_size_compare = pd.DataFrame(0, index=range(0,255), \
columns=['Minimum Neighbors','Epsilon(m)','Coverage','Count'])
start_mn = 3
end_mn = 20
start_eps = 50
end_eps = 1500
start_time = time.time()
indx = 0
for mn in range(start_mn,end_mn+1):
for e in range(start_eps, end_eps+50, 50):
eps = e/1000/kms_per_radian
dbscn = DBSCAN(eps=eps, min_samples=mn, algorithm='ball_tree', metric='haversine') \
.fit(np.radians(df_restaurants_flagged[['latitude','longitude']].values))
cluster_coverage = sum(dbscn.labels_ >= mn)
cluster_count = sum(np.unique(dbscn.labels_) >= mn)
compression = 100*(1 - float(cluster_count) / cluster_coverage)
df_population_size_compare.at[indx, 'Minimum Neighbors'] = mn
df_population_size_compare.at[indx, 'Epsilon(m)'] = e
df_population_size_compare.at[indx, 'Coverage'] = cluster_coverage
df_population_size_compare.at[indx, 'Count'] = cluster_count
df_population_size_compare['Compression'] = compression
indx = indx + 1
print("Completed mn:{} e:{} in {:,.2f} seconds".format(mn, e, time.time() - start_time))
df_population_size_compare.head()
Cluster data using DB SCAN algorithm
df_population_size_compare.to_pickle('df_population_size_compare.pkl')
df_population_size_compare = pd.read_pickle('df_population_size_compare.pkl')
df_population_size_compare['Compression'] = 100 * (1 - df_population_size_compare['Count']/df_population_size_compare['Coverage'])
df_population_size_compare.head()
x_col,y_col,z_col = 'Minimum Neighbors','Epsilon(m)','Coverage'
x_start = 3
x_end = 20
max_x = []
max_y = []
max_z = []
max_2x = []
max_2y = []
max_2z = []
for i in range(x_start, x_end+1):
# figure out the peak values line
df = df_population_size_compare[df_population_size_compare[x_col] == (i)].reset_index(drop=True)
max_row = df[df[z_col] == df[z_col].max()]
max_x.append(max_row[x_col].values[0])
max_y.append(max_row[y_col].values[0])
max_z.append(max_row[z_col].values[0])
# find the second peak line
df = df_population_size_compare[df_population_size_compare[x_col]==i].reset_index(drop=True)
peak2df = df[df[y_col] <= 300]
max_2row = peak2df[peak2df[z_col] == peak2df[z_col].max()]
max_2x.append(max_2row[x_col].values[0])
max_2y.append(max_2row[y_col].values[0])
max_2z.append(max_2row[z_col].values[0])
x = df_population_size_compare[x_col].values
y = df_population_size_compare[y_col].values
z = df_population_size_compare[z_col].values
traces = []
traces.append(go.Scatter3d(
x=x,
y=y,
z=z,
mode='markers',
marker=dict(
size=6,
color=z,
colorscale='Jet',
opacity=0.8
),
showlegend=True,
name='Coverage'
))
# draw max line for z values
traces.append(go.Scatter3d(
z=max_z,
y=max_y,
x=max_x,
line=dict(
color='teal',
width = 4
),
mode='lines',
name='Max Counts Line'
))
# draw 2nd peak line for z values
traces.append(go.Scatter3d(
z=max_2z,
y=max_2y,
x=max_2x,
line=dict(
color='purple',
width = 4
),
mode='lines',
name='2nd Max Counts Line'
))
layout = go.Layout(
margin=dict(
l=0,
r=0,
b=50,
t=50
),
paper_bgcolor='#999999',
title='Clustered Points Coverage vs. Minimum Neighbors & Distance (meters)',
scene=dict(
camera = dict(
up=dict(x=0, y=0, z=1),
center=dict(x=0, y=0, z=-.25),
eye=dict(x=1.25, y=1.25, z=1.25)
),
xaxis=dict( title= x_col),
yaxis=dict( title= y_col),
zaxis=dict( title= z_col)
),
font= dict(color='#ffffff')
)
fig = go.Figure(data=traces, layout=layout)
display(HTML('<a id="mn_e_coverage">DBSCAN Min. Neighbors & Distance vs Coverage</a>'))
iplot(fig, filename='3d-scatter-colorscale')
From the 3D Scatter Heat Plot above, we can observe that the cluster Coverage (total number of points clustered) is inversly proportional to Minimum Neighbors count where it is maximized at mn=x=3, whereas, it has 2 peaks on Maximum Distance Epsilon(e) axis. First peak is at 550 meters, second peak is between values of 50 to 350 for values of Minimum Neighbors less than 6.
To further narrow down to ideal parameters, we will look at Ribbon Plot for values of Minimum Neighbors X-axies and Maximum Distance Epsilon Y-axis against number of clusters that resulted from clusterings Count.
The aim is to narrow down to a range where cluster count is maximized.
x_col,y_col,z_col = 'Minimum Neighbors','Epsilon(m)','Count'
x_start = 3
x_end = 20
y_start = 0
y_end = 30
traces = []
max_x = []
max_y = []
max_z = []
for i in range(x_start, x_end+1):
x = []
y = []
z = []
ci = int(255/18*i) # "color index"
df = df_population_size_compare[df_population_size_compare[x_col] == (i)].reset_index(drop=True)
max_row = df[df[z_col] == df[z_col].max()]
max_x.append(max_row[x_col].values[0])
max_y.append(max_row[y_col].values[0])
max_z.append(max_row[z_col].values[0])
max_x.append(max_row[x_col].values[0] + 0.5)
max_y.append(max_row[y_col].values[0])
max_z.append(max_row[z_col].values[0])
for j in range(y_start, y_end):
x.append([i, i+.5])
y.append([df.loc[j,y_col], df.loc[j,y_col]])
z.append([df.loc[j,z_col], df.loc[j,z_col]])
traces.append(dict(
z=z,
x=x,
y=y,
colorscale=[ [i, 'rgb(255,%d,%d)'%(ci, ci)] for i in np.arange(0,1.1,0.1) ],
showscale=False,
type='surface'
))
# draw max line for z values
traces.append(go.Scatter3d(
z=max_z,
y=max_y,
x=max_x,
line=dict(
color='green',
width = 8
),
mode='lines',
name='Max Counts Line'
))
layout = go.Layout(
autosize=True,
height=500,
margin=go.layout.Margin(
l=0,
r=0,
b=0,
t=50,
pad=0
),
paper_bgcolor='#999999',
title='Clustered Ribbons of Cluster Count vs. on Minimum Neighbors & Distance (meters)',
scene=dict(
camera = dict(
up=dict(x=0, y=0, z=1),
center=dict(x=0, y=0, z=-.25),
eye=dict(x=1.5, y=1.5, z=1.5)
),
xaxis=dict( title= x_col),
yaxis=dict( title= y_col),
zaxis=dict( title= z_col)
),
font= dict(color='#ffffff')
)
fig = { 'data':traces, 'layout': layout }
display(HTML('<a id="mn_e_count">DBSCAN Min. Neighbors & Distance vs Cluster Count</a>'))
iplot(fig, filename='ribbon-plot-python')
The Ribbon Chart above shows that number of clusters grows inversly proprotional to number of Minimum Neighbors. It peaks around mn = 3. For Maximum Distance to include locations within a cluster Epsilon, cluster count peaks between values of 50 to 350.
We can observe that there is a convergence from both graphs (Coverage & Count) for ranges:
Minimum Neighbors : 3 - 6
Epsilon(e) : 50 - 350 meters
We will investigate only these ranges from here onwards.
df_population_dist_compare = pd.DataFrame(None, index=range(0,28), \
columns=['Minimum Neighbors','Epsilon(m)','Min','Max', 'Labels'])
start_mn = 3
end_mn = 6
start_eps = 50
end_eps = 350
start_time = time.time()
indx = 0
for mn in range(start_mn,end_mn+1):
for e in range(start_eps, end_eps+50, 50):
eps = e/1000/kms_per_radian
dbscn = DBSCAN(eps=eps, min_samples=mn, algorithm='ball_tree', metric='haversine') \
.fit(np.radians(df_restaurants_flagged[['latitude','longitude']].values))
df = pd.DataFrame(dbscn.labels_, columns=['label'])
df_counts = df.groupby(['label']).size().reset_index(name='count')
df_counts = df_counts[(df_counts['label'] > -1) & (df_counts['count'] >= mn)]
labels = [x for x in dbscn.labels_ if x != -1] # all labels except -1
df_population_dist_compare.at[indx, 'Minimum Neighbors'] = mn
df_population_dist_compare.at[indx, 'Epsilon(m)'] = e
df_population_dist_compare.at[indx, 'Min'] = df_counts['count'].min()
df_population_dist_compare.at[indx, 'Max'] = df_counts['count'].max()
df_population_dist_compare.at[indx, 'Labels'] = labels
indx = indx + 1
print("Completed mn:{} e:{} in {:,.2f} seconds".format(mn, e, time.time() - start_time))
df_population_dist_compare.head()
df_population_dist_compare.to_pickle('df_population_dist_compare.pkl')
df_population_dist_compare = pd.read_pickle('df_population_dist_compare.pkl')
x_col,y_col,z_col = 'Minimum Neighbors','Epsilon(m)','Max'
x = df_population_dist_compare[x_col].values
y = df_population_dist_compare[y_col].values
z = df_population_dist_compare[z_col].values
zmin = df_population_dist_compare[z_col].min()
zmax = df_population_dist_compare[z_col].max()
intensity = (df_population_dist_compare[z_col].values - zmin)/(zmax-zmin)
traces = []
traces.append(
go.Mesh3d(
x = x,
y = y,
z = z,
intensity = z,
opacity=0.6,
colorscale = 'Earth',
reversescale=True
)
)
layout = go.Layout(
title='Largest Cluster vs. Min Neighbors and Epsilon',
paper_bgcolor='#999999',
scene = dict(
camera = dict(
up=dict(x=0, y=0, z=1),
center=dict(x=0, y=0, z=-.25),
eye=dict(x=-2, y=-.8, z=0.3)
),
xaxis=dict( title= x_col),
yaxis=dict( title= y_col),
zaxis=dict( title= z_col)
),
font= dict(color='#ffffff')
)
fig = go.Figure(data=traces, layout=layout)
display(HTML('<a id="mn_e_largest_cluster">DBSCAN Min. Neighbors & Distance vs Largest Cluster Size</a>'))
iplot(fig, filename='max-3d-mesh')
numCols = 4
fig = tools.make_subplots(rows=7, cols=4)
idx = 0
for index, row in df_population_dist_compare.iterrows():
trace = go.Histogram(
x = row['Labels'],
name = "mn:{}<br>e:{}" \
.format(row['Minimum Neighbors'], row['Epsilon(m)'])
)
i,j = idx // numCols + 1, idx % numCols + 1
fig.append_trace(trace, i, j)
idx = idx + 1
fig['layout']['xaxis' + str(idx)]['tickformat'] = 's'
fig['layout']['yaxis' + str(idx)]['tickformat'] = 's'
fig['layout']['paper_bgcolor'] = '#999999'
fig['layout']['font']['color'] = '#ffffff'
fig['layout']['font']['size'] = 9
fig['layout']['xaxis']['tickformat'] = 's'
fig['layout']['yaxis' + str(idx)]['tickformat'] = 's'
display(HTML('<a id="mn_e_histograms">DBSCAN Label Count Histogram for Min. Neighbors & Distance</a>'))
iplot(fig, filename='binning function')
Based on the histograms drawn above, Teal hisgtogram with Minimum Neighbor distance of 100 meters and Epsilon(e) value of 4 would be our parameters of choice due to the following reasons:
- In Cluster Count Ribbon Graph, it is on the maximum curve. It will provide highest number of clusters for minimum 4 neighbors
- In Coverage Scatter Graph, it is well above mn=100 and e=5 and rest of the value, only below outliers (which would potentially include noise)
- It is in the lower (earth) range of the surface graph which indicates that maximum count of businesses in a cluster will be minimized.
- Its histogram is least skewed for e=4 values, which means that its clusters would be more evenly distributed than higher e value.
- We will not be selecting e=3 even though it has most evenly distributed histograms because it will not optimize number of cluster.
Define parameters for DB SCAN clustering algorithm
- epsilon: [ 100 meters ] We are setting 100 meters as the distance limit for a neghboring business to be included within a particular cluster. It means that, as long as, there are businesses within 100 meters of each other, they will keep getting included within the same cluster.
- min_neighbors: [ 4 ] Least number of businesses within 100 meters of one another to declare them a cluster. We will eliminate clusters with less number of businesses than min_neighbors threshold to reduce noise.
epsilon = 0.1 / kms_per_radian
min_neighbors = 4
start_time = time.time()
fd_coordinates = pd.read_pickle('fd_coordinates.pkl')
df_restaurants_flagged = pd.read_pickle('df_restaurants_flagged.pkl')
dbscn = DBSCAN(eps=epsilon, min_samples=min_neighbors, algorithm='ball_tree', metric='haversine') \
.fit(np.radians(df_restaurants_flagged[['latitude','longitude']].values))
cluster_labels = dbscn.labels_
print(dbscn)
num_clusters = len(set(cluster_labels))
message = ' Total points clustered: {:,} \n Number of clusters: {:,} \n Compression ratio: {:.1f}% \n Time taken: {:,.2f} seconds'
print(message.format(len(fd_coordinates), num_clusters, \
100*(1 - float(num_clusters) / len(fd_coordinates)), time.time()-start_time))
fd_cluster_labels = pd.DataFrame(cluster_labels, columns=['label'])
print('Number of labels:{}'.format(len(cluster_labels)))
fd_cluster_labels.to_pickle('fd_cluster_labels.pkl')
fd_cluster_labels.head()
# Join cluster labels with the original dataset of the restaurants
df_restaurants_labeled = df_restaurants_flagged.join(pd.DataFrame(fd_cluster_labels))
# Filter out clusters that do not qualify requirements of minimum neighbors
df_rst_lbl_grouped = df_restaurants_labeled.groupby(['label']).size().reset_index(name='count')
df_lbl_counts = df_rst_lbl_grouped[(df_rst_lbl_grouped['label'] > -1) \
& (df_rst_lbl_grouped['count'] >= min_neighbors)].set_index('label')
# Remove all restaurants that were not labeled
df_restaurants_label_filtered = df_restaurants_labeled.join(df_lbl_counts, on='label', how='inner')
df_restaurants_labeled.to_pickle('df_restaurants_labeled.pkl')
print(len(df_restaurants_label_filtered))
df_restaurants_label_filtered.to_pickle('df_restaurants_label_filtered.pkl')
df_restaurants_label_filtered.head()
df_restaurants_flagged = pd.read_pickle('df_restaurants_flagged.pkl')
df_review_data = pd.read_pickle('df_review_data.pkl')
df_reviews_and_restaurants = df_review_data.join(df_restaurants_label_filtered.set_index('business_id'), \
on='business_id', how='inner')
print(len(df_reviews_and_restaurants))
df_reviews_and_restaurants.head()
Group each user's review for each category restaurants. Higher the count of reviews for a certain category, more the user is likely to visit that category of restaurant.
df_user_rst_visits = df_reviews_and_restaurants.filter(['user_id'] + top_20_specific_categories , axis=1) \
.groupby(['user_id']).sum()
df_user_rst_visits.to_pickle('df_user_rst_visits.pkl')
df_user_rst_visits.head()
Restaurant/Review Count Ratio: The more the users are reviewing a particular category restaurants, the more they are interested in eating that particular kind of food. Thus overall review count of a restaurant category indicates the interest of users in that category of food and restaurants.
Overall in entire population, equilibrium should exist between review count indicating desire (let's call it Demand Indicator) of a particular restaurant's food type and number of restaurants reviewed of that category that cater to that demand (Supply Indicator).
We can calculate the ratio of the number of restaurants to the number of reviews of each category to find out the ratio by which user interest translates into restaurant count of that category in overall population.
df_demand_indicator_by_category = df_user_rst_visits.sum()
df_demand_indicator_by_category.to_frame('Demand (Review Count)')
review_restaurant_ratio = df_supply_indicator_by_category/df_demand_indicator_by_category
df_restaurant_review_ratio = review_restaurant_ratio.to_frame('Supply/Demand (Restaurant/Review) Ratio')
df_restaurant_review_ratio
Save supply/demand ratio indicator for each category of the restaurant
df_restaurants_flagged = pd.read_pickle('df_restaurants_flagged.pkl')
df_user_rst_visits = pd.read_pickle('df_user_rst_visits.pkl')
df_restaurant_review_ratio = pd.read_pickle('df_restaurant_review_ratio.pkl')
fd_coordinates = pd.read_pickle('fd_coordinates.pkl')
fd_cluster_labels = pd.read_pickle('fd_cluster_labels.pkl')
df_restaurants_labeled = pd.read_pickle('df_restaurants_labeled.pkl')
gb = df_restaurants_label_filtered.groupby(['label'])
df_clust_group_info = pd.DataFrame({'size': gb.size()})
df_bus_reviews = df_reviews_and_restaurants.set_index('business_id')
df_restaurant_review_ratio_tps = df_restaurant_review_ratio.transpose()
start_time = time.time()
def get_group_info(cur_cluster):
groupSize = len(cur_cluster)
df_clust_group_info.at[cur_cluster.name, 'size'] = groupSize
df_clust_group_info.at[cur_cluster.name, 'latitude'] = cur_cluster['latitude'].sum()/groupSize
df_clust_group_info.at[cur_cluster.name, 'longitude'] = cur_cluster['longitude'].sum()/groupSize
df_clust_group_info.at[cur_cluster.name, 'city'] = pd.Series(cur_cluster['city'].unique()).str.cat(sep=', ')
df_clust_group_info.at[cur_cluster.name, 'zip'] = pd.Series(cur_cluster['postal_code'].unique()).str.cat(sep=', ')
df_clust_group_info.at[cur_cluster.name, 'neighborhood'] = pd.Series(cur_cluster['neighborhood'].unique()).str.cat(sep=', ')
df_cur_cluster_reviews = cur_cluster[['business_id']].join(df_bus_reviews, on='business_id', how='inner')
df_cur_cluster_unique_users = df_cur_cluster_reviews[['user_id']].drop_duplicates()
df_clust_user_rst_visits = df_cur_cluster_unique_users.join(df_user_rst_visits, on='user_id')
df_clust_group_info.at[cur_cluster.name, 'reviews_count'] = len(df_cur_cluster_reviews)
df_clust_group_info.at[cur_cluster.name, 'user_count'] = len(df_cur_cluster_unique_users)
for category in top_20_specific_categories:
df_clust_group_info.at[cur_cluster.name, category + ' Supply'] = cur_cluster[category].sum()
df_clust_group_info.at[cur_cluster.name, category + ' Demand'] = df_clust_user_rst_visits[category].sum() \
* df_restaurant_review_ratio_tps.loc['Supply/Demand (Restaurant/Review) Ratio',category]
print('Time taken: {:,.2f} minutes - Group # {}'.format((time.time()-start_time)/60, cur_cluster.name))
gb.apply(get_group_info)
df_clust_group_info.head()
df_clust_group_info.to_pickle('df_clust_group_info.pkl')
For denser cluster population, interval with 20 limits is used to indicate size of each cluster and same number of colors to easily recognize each cluster on the map.
df_grouped_cluster_data = pd.read_pickle('df_clust_group_info.pkl')
mapbox_access_token = 'pk.eyJ1IjoiZjhheml6IiwiYSI6ImNqb3plOWp6MjA0bXIzcnFxczZ1bjdrbmwifQ.5qd5W4B06UUZc20Jax12OA'
# interval_20 = pd.interval_range(start=4, periods=20, freq=2, closed='both').to_tuples()
limits_20 = [(4,5),(6,10),(11,15),(16,20),(21,25),(26,30),(31,35),(35,40),(41,45),(45,50),(51,60), \
(61,70),(71,80),(81,100),(101,150),(151,200),(201,300),(301,400),(401,1000),(1001,2000)]
colors_20 = ['RGB(230,25,75)','RGB(60,180,75)','RGB(255,225,25)','RGB(67,99,216)','RGB(245,130,49)', \
'RGB(145,30,180)','RGB(70,240,240)','RGB(240,50,230)','RGB(188,246,12)','RGB(250,190,190)', \
'RGB(0,128,128)', 'RGB(230,190,255)','RGB(154,99,36)','RGB(255,250,200)','RGB(170,255,195)', \
'RGB(255,216,177)','RGB(0,0,117)','RGB(128,128,128)','RGB(128,0,0)','RGB(128,128,0)']
For city level map, 10 interval limits and 10 colors are used to indicate each cluster's size and color
label_sizes = df_restaurants_label_filtered[['business_id','label']].groupby(['label']).agg(['count'])
label_sizes['business_id']['count'].nlargest(10)
#interval_10 = pd.interval_range(start=4, periods=10, freq=4, closed='both').to_tuples()
limits_10 = [(4,10),(11,20),(21,30),(31,40),(41,50),(51,70),(71,100),(101,200),(201,400),(401,2000)]
colors_10 = ['#0000FF', '#008080', '#FF0000', '#008000', '#808000', '#000080', '#C36900', \
'#FF00FF', '#800080','#00FF00']
df_grouped_cluster_data.head()
df_grouped_cluster_data['label'] = df_grouped_cluster_data.index
for index,row in df_grouped_cluster_data.iterrows():
df_grouped_cluster_data.at[index, 'neighborhood'] = (row['neighborhood'][:50] + (row['neighborhood'][:50] and '...'))
df_grouped_cluster_data.at[index, 'zip'] = (row['zip'][:50] + (row['zip'][:50] and '...'))
clusters = []
scale = 1
for i in range(len(limits_20)):
lim = limits_20[i]
df_sub = df_grouped_cluster_data[((df_grouped_cluster_data['size'] >= lim[0]) \
& (df_grouped_cluster_data['size'] <= lim[1]))]
cluster = dict(
type = 'scattergeo',
locationmode = 'USA-states',
lon = df_sub['longitude'],
lat = df_sub['latitude'],
text = 'City: ' + df_sub['city'] + \
'<br>Neighborhood(s): ' + df_sub['neighborhood'] + \
'<br> Zip/Postal Code(s):' + df_sub['zip'],
sizemode = 'diameter',
marker = dict(
size = [i*scale]*len(df_sub),
color = colors_20[i],
line = dict(width = 2,color = 'black')
),
name = '{0} - {1}'.format(lim[0],lim[1]) )
clusters.append(cluster)
layout = dict(
title = 'Yelp Reviewed Restaurants in North America',
showlegend = True,
geo = dict(
scope='north america',
projection=dict( type='albers usa canada' ),
resolution= 50,
lonaxis= {
'range': [-150, -55]
},
lataxis= {
'range': [30, 50]
},
center=dict(
lat=43.6543,
lon=-79.3860
),
showland = True,
landcolor = 'rgb(217, 217, 217)',
subunitwidth=1,
countrywidth=1,
subunitcolor="rgb(255, 255, 100)",
countrycolor="rgba(255, 200, 255)"
),
)
fig = dict( data=clusters, layout=layout )
display(HTML('<a id="north_america_clustered">North America Clustered Restaurants by Location (All Categories)</a>'))
iplot( fig, validate=False, filename='d3-bubble-map-populations' )
clusters = []
scale = 3
for i in range(len(limits_10)):
lim = limits_10[i]
df_sub = df_grouped_cluster_data[((df_grouped_cluster_data['size'] >= lim[0]) \
& (df_grouped_cluster_data['size'] <= lim[1]))]
cluster = dict(
type = 'scattergeo',
locationmode = 'USA-states',
lon = df_sub['longitude'],
lat = df_sub['latitude'],
text = 'City: ' + df_sub['city'] + \
'<br>Size: ' + df_sub['size'].astype(str) + \
'<br>Neighborhood: ' + df_sub['neighborhood'] + \
'<br>Postal Code:' + df_sub['zip'],
sizemode = 'diameter',
marker = dict(
size = [i*scale]*len(df_sub),
color = colors_10[i],
line = dict(width = 2,color = 'black')
),
name = '{0} - {1}'.format(lim[0],lim[1]) )
clusters.append(cluster)
layout = dict(
title = 'Yelp Reviewed Clustered Restaurants in Toronto',
showlegend = True,
geo = dict(
scope='north america',
projection=dict( type='albers usa canada', scale=500 ),
resolution= 50,
lonaxis= {
'range': [-130, -55]
},
lataxis= {
'range': [30, 50]
},
center=dict(
lat=43.6543,
lon=-79.3860
),
showland = True,
landcolor = 'rgb(217, 217, 217)',
subunitwidth=1,
countrywidth=1,
subunitcolor="rgb(120, 120, 120)",
countrycolor="rgb(255, 255, 255)"
),
)
fig = dict( data=clusters, layout=layout )
display(HTML('<a id="toronto_clustered">All Clustered Restaurants on Sketch (Toronto)</a>'))
iplot( fig, validate=False, filename='d3-bubble-map-populations' )
clusters = []
scale = 4
for i in range(len(limits_10)):
lim = limits_10[i]
df_sub = df_grouped_cluster_data[((df_grouped_cluster_data['size'] >= lim[0]) \
& (df_grouped_cluster_data['size'] <= lim[1]))]
cluster = go.Scattermapbox(
lon = df_sub['longitude'],
lat = df_sub['latitude'],
text = 'Cluster #: ' + df_sub['label'].astype(str) + \
'<br>Size: ' + df_sub['size'].astype(str) + \
'<br>City: ' + df_sub['city'] + \
'<br>Neighborhood: ' + df_sub['neighborhood'] + \
'<br>Postal Code:' + df_sub['zip'],
mode = 'markers',
marker = dict(
size = [i*scale]*len(df_sub),
color = colors_10[i]
),
name = '[{0} - {1}]'.format(lim[0],lim[1]) )
border = go.Scattermapbox(
lon = df_sub['longitude'],
lat = df_sub['latitude'],
mode='markers',
marker=dict(
size=[i * scale + 1]*len(df_sub),
color='black',
opacity=0.4
),
hoverinfo='none',
showlegend=False)
clusters.append(border)
clusters.append(cluster)
layout = go.Layout(
title = 'Yelp Reviewed Clustered Restaurants on Toronto Map',
autosize=True,
hovermode='closest',
mapbox=dict(
accesstoken=mapbox_access_token,
bearing=0,
center=dict(
lat=43.6543,
lon=-79.3860
),
pitch=0,
zoom=12
),
)
fig = dict(data=clusters, layout=layout)
display(HTML('<a id="toronto_clustered_map">All Clustered Restaurants on Map (Toronto)</a>'))
iplot(fig, filename='Multiple Mapbox')
demand_cats = [x + ' Demand' for x in top_20_specific_categories]
supply_cats = [x + ' Supply' for x in top_20_specific_categories]
local_demand_cats = [x + ' Local Demand' for x in top_20_specific_categories]
display_cats = [x + ' Display' for x in top_20_specific_categories]
# Add new Local Demand columns for each category
df_grouped_cluster_data[local_demand_cats] = pd.DataFrame([[np.nan] * len(top_20_specific_categories)])
df_grouped_cluster_data[display_cats] = pd.DataFrame([[np.nan] * len(top_20_specific_categories)])
scaler = MinMaxScaler(feature_range=(0, 1))
t = None
for index,row in df_grouped_cluster_data.iterrows():
cluster_supply = row[supply_cats].transpose().sum()
cluster_demand = row[demand_cats].transpose().sum()
cluster_adjustment_ratio = (cluster_supply / cluster_demand) if cluster_demand > 0 else 0
for x in top_20_specific_categories:
localDemand = round(row[x + ' Demand'] * cluster_adjustment_ratio)
df_grouped_cluster_data.at[index, x + ' Local Demand'] = localDemand
# apply (n - min)/(max - min) formula to the difference of Local Demand and Supply to normalize display
diff = row[supply_cats].values - df_grouped_cluster_data.loc[index, local_demand_cats].values
scaled = scaler.fit_transform(diff.astype('float64').reshape(-1,1))
for i in range(len(diff)):
df_grouped_cluster_data.at[index, top_20_specific_categories[i] + ' Display'] = scaled[i]
df_grouped_cluster_data[display_cats].head()
clusters = []
scale = 4
colors = ['maroon', 'purple', 'navy', 'teal', 'olive']
for x in range(0, len(top_20_specific_categories)):
cat = top_20_specific_categories[x]
for i in range(len(limits_10)):
lim = limits_10[i]
df_sub = df_grouped_cluster_data[((df_grouped_cluster_data['size'] >= lim[0]) \
& (df_grouped_cluster_data['size'] <= lim[1]))]
demandStr, supplyStr = '{} Demand'.format(cat), '{} Supply'.format(cat)
local_sd_ratio = df_sub[demandStr].max() / df_sub[supplyStr].max()
cluster = go.Scattermapbox(
lon = df_sub['longitude'],
lat = df_sub['latitude'],
text = 'Category: {}'.format(cat) + \
'<br>Size: ' + df_sub['size'].astype(str) + \
'<br>City: ' + df_sub['city'] + \
'<br>Demand: ' + df_sub['{} Local Demand'.format(cat)].astype(str) + \
'<br>Supply: ' + df_sub['{} Supply'.format(cat)].astype(str) + \
'<br>Neighborhood: ' + df_sub['neighborhood'] + \
'<br>Postal Code:' + df_sub['zip'],
mode = 'markers',
marker = dict(
size = [i*scale]*len(df_sub),
color = colors[x % 5],
opacity = df_sub['{} Display'.format(cat)]
),
name = '[{0} - {1}]'.format(lim[0],lim[1]) ,
visible= (False if x > 0 else True)
)
clusters.append(cluster)
# add border for all clusters
for i in range(len(limits_10)):
lim = limits_10[i]
df_sub = df_grouped_cluster_data[((df_grouped_cluster_data['size'] >= lim[0]) \
& (df_grouped_cluster_data['size'] <= lim[1]))]
border = go.Scattermapbox(
lon = df_sub['longitude'],
lat = df_sub['latitude'],
mode='markers',
marker=dict(
size=[i * scale + 1]*len(df_sub),
color='black',
opacity=0.1
),
hoverinfo='none',
visible=True,
showlegend=False)
clusters.append(border)
steps = []
trc_count = 10
category_size = len(top_20_specific_categories)
v = [False] * traces_per_category * category_size + [True] * traces_per_category
for i in range(0, category_size):
step = dict(method='restyle',
args=['visible', v[0:i * trc_count] + [True] * trc_count + v[ (i+1) * trc_count: len(v)]],
label='{}'.format(top_20_specific_categories[i]))
steps.append(step)
sliders = [dict(active=0,
pad={"t": 1},
steps=steps)]
layout = go.Layout(
title = 'Yelp Reviewed Restaurants Supply/Demand by Category Slider',
autosize=True,
hovermode='closest',
mapbox=dict(
accesstoken=mapbox_access_token,
bearing=0,
center=dict(
lat=43.6543,
lon=-79.3860
),
pitch=0,
zoom=12
),
sliders = sliders
)
fig = dict(data=clusters, layout=layout)
display(HTML('<a id="toronto_clustered_categorized">Slider Controlled Categories Displaying Demand (Toronto)</a>'))
iplot(fig, filename='Multiple Mapbox')